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The Fisher Information matrix is a widely used measure for applications ranging from statistical 
inference, information geometry, experiment design, to the study of criticality in biological systems. 
Yet there is no commonly accepted non-parametric algorithm to estimate it from real data. In 
this rapid communication we show how to accurately estimate the Fisher information in a non- 
parametric way. We also develop a numerical procedure to minimize the errors by choosing the 
interval of the finite difference scheme necessary to compute the derivatives in the definition of 
the Fisher information. Our method uses the recently published “Density Estimation using Field 
Theory” algorithm to compute the probability density functions for continuous densities. We use 
the Fisher information of the normal distribution to validate our method and as an example we 
compute the temperature component of the Fisher Information Matrix in the two dimensional Ising 
model and show that it obeys the expected relation to the heat capacity and therefore peaks at the 
phase transition at the correct critical temperature. 

PACS numbers: 02.60.-x,05.I0.-a,05.50.+q,64.60.-i 


Consider a probabilistic description of a system as a 
probability density function (PDF) p{x;0). The observ¬ 
ables of the system are collected as parts of the vector 
X and the PDF will typically depend on a vector of con¬ 
tinuous parameters 0. It is interesting how the system 
responds to parameter changes, e.g. at phase transitions 
or system dynamics near bifurcations. A measure of the 
sensitivity of the PDF to the parameters is the Fisher 
Information Matrix (FIM) [I]. It is a symmetric matrix 
labeled by the parameters of the PDF. If a small param¬ 
eter change causes a large change in the PDF, the corre¬ 
sponding entry in the FIM will be large. Much literature 
discusses the different uses of the FIM, its interpretation 
as a Riemmanian metric on the statistical manifold [2] 
and its relation to theories of phase transitions [3-12] 
and complex systems [13-19]. 

Since the FIM is directly computed from the PDF, its 
accurate estimation depends on the general density esti¬ 
mation problem [20]. Density estimation aims to obtain 
the best estimate Qest of a distribution Qtme given N 
independently drawn samples. We distinguish between 
parametric and non-parametric estimation. Parametric 
estimates constrain Qest fo depend on a few parameters 
that are estimated from the data [21]. By the Cramer- 
Rao inequality [I] the inverse of the Fisher information 
(FI) is a lower-bound on the variance of the estimated pa¬ 
rameters. FI is therefore often computed in the paramet¬ 
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ric setting. In these cases the FI is computed analytically 
from the assumed function. 

When we do not assume a specific form for the PDF, 
we estimate the density non-parametrically. Thus, the 
data determines the shape of the distribution. Areas 
with higher probability density will contain more data 
points than areas with lower probability. The main prob¬ 
lem of non-parametric methods is how to balance the 
goodness of fit to the data and the smoothness of the 
estimated curve [20]. For example, kernel density es¬ 
timators (KDEs) are a sum over kernel functions with 
width h, positioned at each data point, i.e. Qest(^) = 

N 

{hN)~^ K[{x — Xi)/h] where Xi is a data point and 

i=l 

iF is a kernel function. The bandwidth h controls the 
smoothness of the estimate. In the limit h ^ the esti¬ 
mate is a sum of delta functions at each data point; in the 
^ oo it is uniform. Choosing the correct bandwidth 
is therefore important. Taken too large, the estimate 
will hide crucial features. Too small a bandwidth causes 
spurious peaks in the estimate, especially for long-tailed 
distributions [20]. Important to this study, the amount 
of smoothing directly affects the value of the FI. This can 
be seen from its definition: 

g^AO) = {{d^,\np){d^\np )), ( 1 ) 

which depends on the derivatives of the PDF. Here = 
djdO^, p is a PDF and the average is with respect to 
p. If, e.g., the estimated PDF Qest is smoother than the 
true PDF Qtme, the estimate for the FI will be smaller 
than the true FI. 

One elegant approach that derives the smoothness 
from the data itself was proposed in [22]. The authors 
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used field theory to formulate the notion of a smooth¬ 
ness scale as an ultraviolet cutoff, treating the smooth¬ 
ness length scale ^ as a parameter in a Bayesian infer¬ 
ence procedure. They showed that, in the large N limit, 
the data selects an appropriate length scale. Recently, 
this method was developed into a fast and accurate al¬ 
gorithm called DEFT (Density Estimation using Field 
Theory) [23]. The algorithm was implemented in one 
and two dimensions, since it suffers from the “curse of 
dimensionality” [23]. 

Previous work on the nonparametric estimation of FI 
primarily dealt with PDFs with location-like parameters, 
i.e. p{x; 0) = p{x — 0). There Huber [24] found a unique 
interpolation of the cumulative distribution function that 
maximizes the FI. Kostal and Pokora [25] adapted the 
maximized penalized likelihood method of Goodd and 
Gaskins [26] to compute the FI. Kostal and Pokora re¬ 
jected the use of KDE for the direct computation of the 
FI because no appropriate bandwidth parameter to con¬ 
trol of the p'/p term in (1) is known [25]. In this work we 
estimate PDFs from samples drawn from the normal dis¬ 
tribution at different parameter values using both DEFT 
and KDE with a Gaussian kernel, and compare the FI re¬ 
sults to the analytic solution. We obtain accurate results 
using DEFT, which are an improvement over using KDE. 
We also use DEFT to compute the FI in the two dimen¬ 
sional Ising model showing the computation is accurate 
also at the phase transition. 

To proceed we replace the derivatives in Eq. (1) with 
centered finite-difference derivatives: 


9^1 




p{x; 9 + AO^) — p{x; 9 — A9^) 


2A9f^ 

p{x;9 ^ A9^) — p{x;9 — A9^) dx 


/ 


2A6»'^ p(x;e) 

\a.p{x\ 9 + A6*'') — lnp(a;; 9 — A9^) 


2A6>#‘ 

\n.p{x\ 9 + A9'^) — \n.p{x] 9 — A9'') 
2A6»^ 


(2a) 


(2b) 


p{x] 9)dx . 


Here A9^ indicates a change in the value of only one 
parameter 9^ keeping all other parameters fixed, i.e., 9-\- 
A9^ = (6>^,..., + A9 ^,..., 9^). The error introduced 

by this replacement is proportional to 0{A9‘^/6) (for each 
derivative) as can be verified by Taylor expansion. Higher 
order finite-difference schemes can be used but not, in our 
experience, a lower order one-sided derivative because the 
estimate does not converge to the true value (data not 
shown). 

This introduces a new free parameter: the size of the 
difference A9^. The value of A9^ strongly influences 
the accuracy of the computation. Two sources of error 
determine the optimal A9^: the aforementioned numer¬ 
ical derivative error and the finite sample size N. The 
first error, which scales like 0[(A^^)^], decreases with 
decreasing A9. The second source, however, decreases 
with increasing A9^. This happens because an esti¬ 
mate from a finite number of samples is always under¬ 
determined. Any estimate is one curve from a group of 


curves that are close, but not equal, to the true density. 
The larger the number of samples, the smaller the size of 
the group. If A9^ is too small, the groups of the densities 
in the numerical derivatives will overlap and the differ¬ 
ence p(x; 6>+A6>^)—p(x; 6>—A6>^) will be ill-defined. This 
leads to one of our main results: since A9^ cannot be too 
small or too large, there is an optimal value with minimal 
error between the two extremes. 

To estimate the curve group size and avoid overlaps, 
we use large deviations theory. According to Sanov’s 
theorem [27] the appropriate distance measure is the 
Kullback-Leibler (KL) divergence: 

Vkl[Q\\P]= j Q{x)\n‘^dx (3) 

x^X 


which is defined for two densities P{x) and Q{x) where 
the support of P and Q overlap. The probability that a 
set of N samples independently drawn from P appears 
to be drawn from Q is proportional to: 

exp{-NVKL[Q\\P]). (4) 

In the limit of infinite sample size this tends to zero. 
For finite N the set of distributions whose KL-divergence 
with P is small enough such that this probability is fi¬ 
nite forms the curve group. This can be interpreted as a 
hypersphere in parameter space centered at 9 with an N 
and 9 dependent radius. To minimize error, the radius 
at 9 and 9-\- A9^ should be small compared to A9^. The 
ideal case is drawn schematically in Fig. la with well sep¬ 
arated densities and in Fig. lb where A9^ is too small. 
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(a) Well separated densities. 
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(b) Overlapping densities. 


->• 


FIG. 1. Schematic drawing in one dimension with points of 
estimation 9 and 9 -h A9. The gray area is the hypersphere. 
£ is the radius of the hypersphere in units of A9. 


To compute hypersphere radius we take P = p{x; 9) 
and Q = p{x]9 + eA9^) in Eq. (4). We thus seek the 
density Q at the edge of the hypersphere and parametrize 
it with 5, the hypersphere radius in units of A9^. The 
KL-divergence of two neighbouring distributions is ap- 
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proximately: 


VKL[Pm\P{^ + ~ = 0{Ad^) 

( 5 ) 

where we use the Einstein summation convention for re¬ 
peated indices [28]. Inserting Eq. (5) into Eq. (4) we get 


exp 


Ne^ 


g^,{0)A0^A0^ 


(6) 


We define the boundary of the hypersphere as the point 
where the probability is equal to e~^. Equating Eq. (6) 
with we obtain the radius 5: 


Ng^,{0)A0^A0- ' ' ' 

The radius depends on the number of samples N, 0, and 
AO^. At a given N and 6>, increasing AO^ will decrease 
the radius and thus increase accuracy. 

As an analytically solvable example, we take the uni¬ 
variate normal distribution A/'(/i, a). Its El is 

1 2 

QjjLiJL 1 9(j(j 5 9^0- 9(j^i 0 • ( 3 ) 

We focus on the El of a, which is not a location param¬ 
eter. Inserting this to Eq. (7) yields 


with Act = AO^. This guides the choice of Act for a given 
A, cr, and desired radius e. We can get the same result 
using the Cramer-Rao inequality. The minimal variance 
of an unbiased estimator for a is l/^fcrcr- Given N sam¬ 
ples this equals . Demand that the variance of a 

is equal to ^(^Acr)^ (the factor ^ ensures a consistent 
definition of e). This variance is equivalent to a hyper¬ 
sphere radius of sAa. We then have cr^/2A = {sAa)‘^/2. 
Solving for Act yields Eq. (9). 

Eor real data the El is unknown and can be estimated 
iteratively. Eirst compute the El with AO^ that ensure a 
good approximation of the numerical derivatives. Then 
use the El to compute 5. If it is too large (5 « 0.1 seems 
to be reasonable), increase AO^ or N. 

We demonstrate our main results by computing g^j^- 
from independently drawn normally distributed samples. 
We first compare DEET (with number of grid points 
G = 100, smoothness parameter a = 3 and a bounding 
box twice the interval between the smallest and largest 
sample [23]) and KDE (using Scott’s rule for the band¬ 
width). We used both with the same samples and com¬ 
puted the El from Eq. (2). In the top plot of Eig. 2 the El 
estimate is shown. The black curve is the analytic value, 
the green dots and blue x ’s are the median estimate after 
100 repetitions (error bars are 5 and 95 percentiles) for 
DEET and KDE respectively. We used A = 10^ for each 
density estimate. We used 5 = 0.05 since this yields the 
best results. 
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FIG. 2. (Color online) A comparison between Gaussian KDE 
and DEFT for density estimation. Top figure shows median 
FI estimates using both methods. Error bars represent 5 and 
95 percentiles. Bottom figure shows the relative errors. The 
values were computed with N = 10^, £ = 0.05, and 100 rep¬ 
etitions at each a. The same samples were used by both 
methods. 


Both methods seem to follow the analytic curve, how¬ 
ever from the relative errors it is clear that KDE con¬ 
sistently overestimates the El by about 40% and the 
distance between 5 and 95 percentile is about 100% of 
the original value. DEET has zero bias and a spread of 
30% — 40%. We conclude that DEET provides an im¬ 
provement over KDE both in the estimated value and in 
the error margins. In the above computations we used 
Eq. (2a) for computation with DEET and Eq. (2b) for 
KDE, because KDE was extremely unstable when com¬ 
puted using Eq. (2a) while DEET performed slightly bet¬ 
ter with Eq. (2a). 

In the following we use DEET exclusively for the den¬ 
sity estimation. To see how the error depends on 5 we 
vary it at a fixed A = 2 x 10^ and plot the relative er¬ 
ror. We computed the El for a = 0.5,1,2,5,10. Each 
computation was repeated 100 times at different 5 and 
the median and 5 and 95 percentiles of the relative error 
{[gaa—FI]/gcra^ whcrc FI is the estimated El) were com¬ 
puted. All curves have the same functional dependence 
on 5 and, as we predicted, there is an optimal value for 
Act, at 5 ^ 0.05. Thus the errors depend on a through 
the combination in Eq. (7), as shown in Eig. 3. All the 
curves have a minimum in the range of 5 G [0.04, 0.1]. At 
small 5 they grow due to errors in the numerical deriva¬ 
tive (Act too large). At large 5 they grow due to over¬ 
lapping densities. The spread (the 90% inter-percentile 
range) is minimal at 5 = 0.05 as well. The shaded re¬ 
gions in the plot represent the inter-percentile range of 
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the various a curves. 



FIG. 3. (Color online) The median relative error as a function 
of £ for different values of cr. FI stands for the computed value 
and Qaa the analytic value. The shaded areas and error bars 
in the inset indicate the 5 and 95 percentiles computed over 
100 repetitions of the computation with N = 2 x 10^. 


To verify the N and 5 dependence of the errors we var¬ 
ied both and computed The result is presented as 
a heat map in Fig. 4. The color represents the absolute- 
value relative estimation error in logarithmic scale. The 
dashed line indicates the s = 0.1 line which represents 
the highest value of 5 where good results are still ob¬ 
tained. The dash-dotted line represents the Aa = 0.35 
line. All computations were done with a = 1.0 and 100 
repetitions. The errors due to small Act seem to follow 
the e = 0.1 curve, showing again the dependence of this 
type of error on 5. Above Aa = 0.35 we see increasing 
errors due to the large value of Aa. The best area for 
the estimation is between the two lines. 

One of the applications of the computation of FI from 
samples is in detecting phase transitions [17]. As a fur¬ 
ther validation we took the two dimensional Ising model, 
which is the prototypical model of a continuous phase 
transition. It is a model of binary spins Si on a square 
lattice with nearest-neighbors interaction. Its Hamilto¬ 
nian is 


77 — ^ ^ Jijh ^ ^ Si , ( 10 ) 

(hi) * 


where (i, j) indicates the sum is on nearest neighbors, 
= ±1 is the value of a spin at site i, Jij is the in¬ 
teraction energy, and h is an external applied magnetic 
field. In more than one dimensions there is a critical 
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FIG. 4. (Golor online) Relative error in the computation of 
the FI for cr = 1.0 as a function of both Aa and N. Gom- 
puted using DEFT with 100 repetitions per point. Dashed 
line represents the £ = 0.1 line and the dash-dotted line is the 
Act = 0.35 line. Unlike the previous plots, here we compute 
the absolute value relative error to avoid problems with the 
logarithmic color-bar scale. 


order-disorder phase transition at a finite temperature. 
Onsager solved the model exactly in two dimensions in 
the thermodynamic limit (infinite number of spins) and 
at zero applied external field [29]. The critical tempera¬ 
ture in the isotropic case {Jij = J) is 

2J 

Te = -^ 2.269J. (11) 

ln(l + y2) ^ ^ 

For simplicity we set J = 1 and Boltzmann’s constant 
ks = 1- 

Prokopenko et. al. [17] computed both the TT and hh 
components of the FI (computed for the Gibbs distribu¬ 
tion with 0^ = h and 0^ = T) in terms of the susceptibil- 
ity xt and the specific heat Ch and showed that: 

Ch Xt 

qtt = ^; ghh = ^ • (i^j 

We therefore expect both to diverge as the system ap¬ 
proaches the critical temperature. In a finite system this 
means that the FI peaks at the critical temperature. 

To validate this result we simulate the Ising model 
and compute the FI. We used the Metropolis-Hastings 
Monte Carlo algorithm to obtain samples of the configu¬ 
ration energy with the Gibbs distribution (at zero exter¬ 
nal field): 

p{S;T) = ^exp[-l3n{S,T,h = 0)]- (13) 

Here p = 1/T is the inverse temperature, S = 

is a configuration of the spins on a L x L 
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square lattice, and Z is the partition function. We then 
estimate the TT component of the FI using Eq. (2) with 
densities estimated from the sampled energies. We also 
computed the specific heat: 

Cn{T) = ^ {{E^) - {Ef) (14) 

where is the total number of spins, E is the energy 
of the configuration, and the average is performed over 
different configurations at the same temperature. 

We plot the result of both the FI and the specific heat 
Ch computation in Fig. 5. The simulation was run on a 
25 X 25 lattice of spins with periodic boundary conditions 
in the temperature range [0.5,4.0] which we divided into 
200 segments, leading to parameter difference of dT ^ 
0.17. We repeated the simulation 5 times and compute 
the median and 5 and 95 percentiles. We used a warm-up 
period of 5 x 10^ time steps and took N = 15, 000 samples 
of the configuration energy. We used DEFT (with G = 
200, a = 3 and a bounding box of [—4,1]) for the density 
estimation. Because the FI depends on T, 5 was not 
constant. Its median was e = 0 . 12 +q;J 7 for the values of 
5 which were not infinite. To verify that Eq. (12) holds, 
we plot the ratio of the two sides of the equation. This 
is presented in the inset in Fig. 5. 
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FIG. 5. (Color online) Blue continuous curve is the TT com¬ 
ponent of the FIM and the green dots are the heat capacity 
in the 2D Ising model on a 25 x 25 grid. Shaded blue and 
green regions indicates the 5 and 95 percentiles computed 
from 5 simulations. Inset shows the ratio of FI to heat capac¬ 
ity {gTTT‘^Cy^L~‘^) which according to Eq. (12) is equal to 
1 . 

There are several technical points we would like to 
mention about the implementation of the method. First, 
we performed the same computation with a smaller grid 
spacing {dT = 0.007). This led to a much worse signal-to- 
noise ratio because the very close densities caused large 
peaks to occur, especially in the low temperature range. 


Second, it is important to find the most suitable param¬ 
eters for DEFT. If the bounding box is too small, or the 
number of grid points too small or too large, the esti¬ 
mated density will have multiple peaks which are not 
apparent in the data. Thus we recommend plotting the 
result of DEFT together with a histogram for several data 
points to make sure the convergence is good. Third, in 
the computation of Eq. (2a) the term 1/p may contribute 
large values at very small p. Equivalently with Eq. (2b), 
when p{x\0 ± AO) are small, their logarithm will again 
be large. This requires the introduction of a numerical 
cutoff. It is common practice to set the contribution of 
a term where p{x) = 0 to zero [18]. We thus introduced 
a cutoff such that if any of the estimates at a particu¬ 
lar point is less than the cutoff, the contribution of this 
point to the integral will be zero. We investigated the ef¬ 
fect of this cutoff for a range of values between 10“^^ and 
10“^. The value of the cutoff had very little effect. In 
the Ising model, the only effect was to change the size of 
the low temperature region where the FI is exactly zero 
(the lower the cutoff, the smaller the region was). In pro¬ 
ducing Fig. 5 we used a value of 10“^^. Lastly we would 
like to mention that the plots in Fig. 5 were obtained by 
the use of Eq. (2b). 

As is clear by the remarks above, care should be taken 
when using this method to compute the FI. One should 
first make sure a good convergence of DEFT is achieved, 
by adjusting G, a and the bounding box. Then make sure 
to select the correct parameter difference A6>, a decision 
that can be aided by the estimation of the 5 parame¬ 
ter. And if necessary, use a cutoff for very low values of 
the probability density. Since we rely on DEFT to per¬ 
form the density estimation, the procedure is limited by 
the limitations of DEFT. It is especially important to 
note that so far DEFT has been implemented in 1 and 
2 dimensions. Higher dimensions suffer from the “curse 
of dimensionality” since they require exponentially many 
grid points to evaluate the density. 
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